In this exercise, we will be using functions from the tidyverse package. You can see we’ve added the chunk option message = FALSE to hide the version information that tidyverse normally displays.

library(tidyverse)

(a) Read in the ice core data

The file icecore.csv contains CO2 concentration measurements made in ice cores from Antarctica over time, where time is defined as the age of the air in years before 2008. That is, an air age of 0 would mean 2008, while an air age of -10008 would be 8000 BC.

Read it into a data frame called icecore.

Hints:

  • You’ll need to insert a code chunk and then add code inside that chunk.
  • Check back with Exercise 1.2 if you can’t remember how to read a CSV file.
  • You can hide the list of columns by changing {r} at the start of your code chunk to {r, message = FALSE}.
icecore <- read_csv("icecore.csv")

(b) Make a scatter plot

Use ggplot to make a simple scatter plot of the air_age_AD and CO2_ppm variables in the icecore data frame. (Think: which of these is more appropriate for the x axis and which is more appropriate for the y axis?)

Hint: you will need to specify the data frame to plot, provide mappings from columns to aesthetic attributes, and add a geom layer.

Look at the axis labels. Is there anything about them you don’t understand or don’t like?

ggplot(icecore, aes(x = air_age_AD, y = CO2_ppm)) +
  geom_point()

The x axis is labelled in scientific notation; -4e+05 means -40,000. We’ll show you tomorrow how to change this into something more readable.

(c) Show the different core samples separately

The icecore dataset includes samples from multiple ice cores, which are recorded in the core variable. Use an appropriate aesthetic (e.g. colour or shape) or display.

ggplot(icecore, aes(x = air_age_AD, y = CO2_ppm, colour = core)) +
  geom_point()

(d) Make a line graph

Make a copy of your code above and change it so it produces a line graph. Try making a line graph with and without points.

ggplot(icecore, aes(x = air_age_AD, y = CO2_ppm, colour = core)) +
  geom_point() +
  geom_line()

(e) Histogram

Make a histogram of the CO2_ppm variable.

Choose an appropriate width for the bins (binwidth = XXX) and make sure the bins line up on a round number (boundary = XXX).

ggplot(icecore, aes(x = CO2_ppm)) +
  geom_histogram(binwidth = 10, boundary = 200)

(f) Histogram with facets

Copy and paste your code from the last question, then use the facet_wrap() function to facet the data by core.

What can you say about the distribution of CO2 concentrations? Is it the same in every core?

ggplot(icecore, aes(x = CO2_ppm)) +
  geom_histogram(binwidth = 10, boundary = 0) +
  facet_wrap(vars(core), ncol = 2)

(g) Box plot

Make a box plot showing the distribution of CO2 concentration by core.

Decide whether the box plots should be vertical and one horizontal.

Experiment adding the width = 0.2 option inside geom_boxplot().

ggplot(icecore, aes(x = CO2_ppm, y = core)) +
  geom_boxplot(width = 0.2)

(h) Make a bar chart

The file afl_grand_finals.csv contains information on every Australian Football League (AFL) grand final played, from 1898 to 2019.

Read it into a data frame called afl_grand_finals and make a bar chart of the variable winner. (Think: should the winning team be displayed on the x or the y axis?)

What do you make of the team “NA”? (What happens when an AFL grand final ends in a draw?)

What order are the teams shown in?

afl_grand_finals <- read_csv("afl_grand_finals.csv")
ggplot(afl_grand_finals, aes(y = winner)) +
  geom_bar()

The “NA” team represents a grand final ending in draw, which has happened three times in the history of the AFL. When this happens, the match is replayed the following week. This practice was controversial and AFL abolished grand final replays in 2016.

The teams are plotted in reverse alphabetical order, with “Adelaide” at the bottom and “Western Bulldogs” near the top. “NA” is a special value indicating missing data, and is sorted after all others.

We will cover how to remove data with missing values in Exercise 2.2.

(i) Extension: scatter plot with a discrete axis

Conventionally, a scatter plot is used to display the relationship between two continuous variables. Sometimes it’s also appropriate to plot points where one axis shows a categorical variable.

Make a scatter plot with year on the x axis and winner on the y axis.

Add a line connecting the points.

How do you think ggplot know which points to join with lines? (There is actually a complicated set of rules in place, which can be overridden if needed.)

If you’re starting to feel comfortable with ggplot, try using the code on the slides to reverse the order of categories.

ggplot(afl_grand_finals, aes(x = year, y = winner)) +
  geom_point() +
  geom_line() +
  scale_y_discrete(limits = rev)

(j) Extension: axis labels

If you’ve finished early, go back to the plots you’ve made and make sure they all have appropriate axis labels, using the labs() function demonstrated in the lecture slides.

By default, ggplot uses variable names for axis labels. This is very helpful when you’re making plots for your own consumption, but if you want to share your graphics with others, you would usually want to provide something more descriptive.


© 2021 Statistical Consulting Centre, The University of Melbourne.